In this section we assigned a vector variable of the packages we used for this project. There is a function to check for the packages and install and library them if they are not already installed and libraried.
Code
# library chunkrequired_packages <-c("dplyr", "readr", "tidyverse", "lmtest", "lubridate", "glmnet", "pls", "readxl", "GGally", "boot", "scales", "ggthemes", "caret")# Function to check if a package is installedinstall_and_load <-function(package) {if (!require(package, character.only =TRUE)) { utils::install.packages(package, dependencies =TRUE)library(package, character.only =TRUE) }}for (pkg in required_packages) {install_and_load(pkg)}
Gathering data from a variety of sources allowed us to create a comprehensive analysis of the housing market trends in New York City. Combining additional non-financial data helps to us understand the factors that influence housing prices and crime rates. This analysis can help policymakers, investors, and residents make informed decisions about their financial and housing situations.
Code
# import data setsmedian_sale_price <- readr::read_csv("data/median_home_price.csv")
Rows: 21488 Columns: 309
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): RegionName, RegionType, StateName, State, Metro, CountyName
dbl (303): RegionID, SizeRank, 2000-01-31, 2000-02-29, 2000-03-31, 2000-04-3...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 283 Columns: 88
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): RegionName, RegionType, StateName
dbl (85): RegionID, SizeRank, 2018-01-31, 2018-02-28, 2018-03-31, 2018-04-30...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 895 Columns: 306
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): RegionName, RegionType, StateName
dbl (303): RegionID, SizeRank, 2000-01-31, 2000-02-29, 2000-03-31, 2000-04-3...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 301 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): FEDFUNDS
date (1): observation_date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 1312 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): MORTGAGE15US
date (1): observation_date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 1312 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): MORTGAGE30US
date (1): observation_date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 24 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): MEHOINUSNYA672N
date (1): observation_date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 24 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): MEHOINUSA646N
date (1): observation_date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 517 Columns: 126
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): RegionName, RegionType, StateName
dbl (123): RegionID, SizeRank, 2015-01-31, 2015-02-28, 2015-03-31, 2015-04-3...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 18 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): OFFENSE
dbl (25): 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, ...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 9 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): OFFENSE
dbl (25): 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, ...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 8 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): OFFENSE
num (25): 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, ...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 300 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (1): NYUR
date (1): observation_date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
2.2 Data Cleaning
Our data required extensive cleaning and formatting, despite the records being very clean.
Dates - We had to manipulate the date columns for all data sets to assure that all date formats and dates we the same to facilitate data merging. Some data sets were set for the last day of the month and others to the first day of the month. We decided to use the first day of the month which required us to use the lubridate package to adjust dates within the same months. There were other data that were yearly, median household income and crime data, which we computed to monthly values and distributed those throughout the months of the year. While this does not create a perfect representation of the data since there isn’t a way to capture a trend, it helps us in an overall time series analysis as opposed to discarding it based on its periodicity.
Dimensions - Many of the data sets were wide data sets that we had to pivot to long data sets to assign variables to each column and have the Date as the joining column, once properly formatted.
Missing Values - We removed any NA values from our data set. This lead us to removing variables New Construction Sales and Rental Cost Index because there wasn’t enough data points to thoroughly model. Removing these two variables allowed us to have over 300 observations per variable as opposed to ~ 70 by retaining them.
Economic Crises - We included dummy variables for the 2008 financial crisis and the COVID-19 pandemic. We imputed a 0 for months that were not included in these crises and 1 if they were in these crises. We hope that this will capture some of the outside impacts on housing prices that would not be otherwise captured without the inclusion of dummy variables.
# join data setsny_housing_data <- ny_mean_sfr_value %>% dplyr::left_join(ny_median_sale_price, by ="Date") %>% dplyr::left_join(interest_rates, by ="Date") %>% dplyr::left_join(mortgage_rate_15_year, by ="Date") %>% dplyr::left_join(mortgage_rate_30_year, by ="Date") %>% dplyr::left_join(ny_median_household_income_monthly, by ="Date") %>% dplyr::left_join(national_median_household_income_monthly, by ="Date") %>% dplyr::left_join(total_crime_commissions_monthly, by ="Date") %>% dplyr::left_join(unemployment_data, by ="Date") %>% dplyr::mutate(housing_crisis =ifelse(Date >=as.Date("2007-12-01") & Date <=as.Date("2009-06-01"), 1, 0)) %>% dplyr::mutate(covid_pandemic =ifelse(Date >=as.Date("2020-03-01") & Date <=as.Date("2023-05-01"), 1, 0)) %>% dplyr::select(-RegionName)# Convert Date from character to Date classny_housing_data$Date <-as.Date(ny_housing_data$Date)# remove NA valuesny_housing_data_clean <- stats::na.omit(ny_housing_data)# writing clean data to csv for common usage in future# commented out for future coding purposes#readr::write_csv(ny_housing_data_clean, "data/ny_housing_data_clean.csv")ny_housing_data_clean <- readr::read_csv("data/ny_housing_data_clean.csv")
New names:
Rows: 280 Columns: 15
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," dbl
(14): ...1, mean_sfr_value, median_sale_price, Fed_Interest_Rate, mortg... date
(1): Date
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
Code
# plot a reduced lm modellm_model <- stats::lm(median_sale_price ~ ., data = ny_housing_data_clean)graphics::plot(lm_model, which =2)
#residuals vs fitted plotgraphics::plot(lm_model, which =1)
Code
#checking for serial correlationlmtest::dwtest(lm_model)
Durbin-Watson test
data: lm_model
DW = 0.36236, p-value < 2.2e-16
alternative hypothesis: true autocorrelation is greater than 0
Code
#run ggpairsggpairs(ny_housing_data_clean)
3 Modeling
3.1 Model Method 1 - OLS
Since we have concluded that our data suffers from serial correlation, we must transform our variables and lag the data. Therefore, we will use logistic regression to build our model. We need to review this statement
3.1.1 Model Characteristics
Variables Included
Date
mean_sfr_value
Fed_Interest_Rate
mortgage_rate_15_year
mortgage_rate_30_year
ny_median_hh_income
major_felonies
misdemeanor_offenses
unemployment_rate
housing_crisis
covid_pandemic
Variables Excluded
national_median_hh_income
non_seven_major_felonies
Date.L1
Date.L3
3.1.2 Initial Findings:
Heteroskedasticity - There was evidence of heteroskedasticity in the residuals, as shown by the “Heteroskedastic Residuals” plot, indicating that the error variance was not constant. Serial correlation was detected using the Durbin-Watson test, with positive correlation in the residuals due to the time series nature of the data.
Nonnormal distribution of residuals - The residuals deviated from normality, as seen in the Q-Q plot and histogram, which suggested non-normal distribution.
3.1.3 Challenges and Adjustments:
Lagging - We tried transforming and lagging date-related variables but found significant correlation persisted. We also adjusted for different lagging periods.
Second Model Choice - We will use a Weighted Least Squares (WLS) as an alternative to manage non-constant variance more effectively as our second model method. We will combine this with stepwise regression to further enhance variable selection.
Code
ny_housing_data_clean <- ny_housing_data_clean %>% dplyr::arrange(Date) %>% dplyr::mutate(Date.L1 = Date %m+%months(-1), # Lag by 1 monthDate.L3 = Date %m+%months(-3) # Lag by 3 months )#regression with lagged variableslm_model_lag <- stats::lm(median_sale_price ~ ., data = ny_housing_data_clean)summary(lm_model_lag)
#residuals vs fitted plotgraphics::plot(final_lm_model, which =1)
Code
lmtest::dwtest(final_lm_model)
Durbin-Watson test
data: final_lm_model
DW = 0.24442, p-value < 2.2e-16
alternative hypothesis: true autocorrelation is greater than 0
3.2 Model 2 - WLS Stepwise Regression
3.2.1 Model Characteristics
Variables Included
Date
mean_sfr_value
Fed_Interest_Rate
mortgage_rate_15_year
mortgage_rate_30_year
ny_median_hh_income
major_felonies
misdemeanor_offenses
unemployment_rate
housing_crisis
covid_pandemic
Variables Excluded
national_median_hh_income
non_seven_major_felonies
Date.L1
Date.L3
3.2.2 Initial Findings
Summary of Findings - The WLS model helped moderate the impact of variance differences. However, the persistent issue of serial correlation needed further examination.
Stepwise Variable Selection - The stepwise regression confirmed the statistical significance of all variables, indicating that they all contribute to the prediction of median sale price.
3.2.3 Challenges and Adjustments
Heteroskedasticity - We will consider using a logarithmic transformation to adjust variables that might be having an imbalanced affect on the model - particularly the yearly data that we distributed monthly.
Code
# calculate WLS model weightslm_model_wls_weights <-1/ stats::fitted(stats::lm(abs(stats::residuals(lm_model_lag)) ~ stats::fitted(lm_model_lag)))^2# fit a WLS modelwls_model <- stats::lm(median_sale_price ~., data = ny_housing_data_clean,weights = lm_model_wls_weights)summary(wls_model)
# fit a null modelwls_model_null <- stats::lm(median_sale_price ~1, data = ny_housing_data_clean)# run a stepwise regressionwls_model_step <- stats::step(wls_model_null, direction ="both", scope = stats::formula(wls_model))
#residuals vs fitted plotgraphics::plot(wls_model_step, which =1)
Code
lmtest::dwtest(wls_model_step)
Durbin-Watson test
data: wls_model_step
DW = 0.35949, p-value < 2.2e-16
alternative hypothesis: true autocorrelation is greater than 0
3.3 Model 3 - Bootstrap
3.3.1 Model Characteristics
Variables Included
Date
mean_sfr_value
Fed_Interest_Rate
mortgage_rate_15_year
mortgage_rate_30_year
log(ny_median_hh_income)
log(major_felonies)
log(misdemeanor_offenses)
unemployment_rate
housing_crisis
covid_pandemic
Variables Excluded
national_median_hh_income
non_seven_major_felonies
3.3.2 Initial Findings
Summary of Findings - After we performed log transformations on our variables, we found an increase in \(r^2_{Adjusted}\) to 99.66 and a decrease in the standard errors of our coefficients. This suggests that the transformed variables are better suited for modeling. However, this gives us a high likelihood of overfitting our model now. To address this, we will perform a bootstrap model to estimate the standard errors and confidence intervals of our coefficients.
3.3.3 Challenges and Adjustments
Overfitting - a likely problem, we computed confidence intervals and standard errors and plotted them. None of the individual predictor CIs cross over the 0 line - meaning we have confidence that they are not ambiguous and can be accurately applied to this model going forward.
Code
# fit linear model with log(<yearly variables>)bootstrap_model <- stats::lm(median_sale_price ~ Date + mean_sfr_value + Fed_Interest_Rate + mortgage_rate_15_year + mortgage_rate_30_year + unemployment_rate + housing_crisis + covid_pandemic +log(ny_median_hh_income) +log(non_seven_major_felonies) +log(major_felonies) +log(misdemeanor_offenses), data = ny_housing_data_clean)# view summary statistics summary(bootstrap_model)
Durbin-Watson test
data: reduced_bootstrap_model
DW = 0.35135, p-value < 2.2e-16
alternative hypothesis: true autocorrelation is greater than 0
Code
# convert model into functionreduced_bootstrap_model_fit <-function(data){ stats::lm(median_sale_price ~ Date + mean_sfr_value + Fed_Interest_Rate + mortgage_rate_15_year + mortgage_rate_30_year + unemployment_rate + housing_crisis + covid_pandemic +log(ny_median_hh_income) +log(major_felonies) +log(misdemeanor_offenses), data = data)}# create bootstrap functionbootstrap_function <-function(data, indices) { resampled_data <- data[indices, ] # resample the data model <-reduced_bootstrap_model_fit(resampled_data) # fit the linear model on the resampled datareturn(coef(model)) }
Code
# set seed for reproducibilitybootstrap_seed <-45set.seed(bootstrap_seed)# bootstrappingbootstrap_results <- boot::boot(data = ny_housing_data_clean , statistic = bootstrap_function, R =1000)summary(bootstrap_results) # view summary statistics
Length Class Mode
t0 12 -none- numeric
t 12000 -none- numeric
R 1 -none- numeric
data 17 tbl_df list
seed 626 -none- numeric
statistic 1 -none- function
sim 1 -none- character
call 4 -none- call
stype 1 -none- character
strata 280 -none- numeric
weights 280 -none- numeric
Code
# Calculate CIs for each coefficientfor (i in1:length(bootstrap_results$t0)) { ci <-boot.ci(bootstrap_results, type ="perc", index = i)print(ci)}
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = bootstrap_results, type = "perc", index = i)
Intervals :
Level Percentile
95% ( 52937, 1385222 )
Calculations and Intervals on Original Scale
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = bootstrap_results, type = "perc", index = i)
Intervals :
Level Percentile
95% (18.93, 22.48 )
Calculations and Intervals on Original Scale
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = bootstrap_results, type = "perc", index = i)
Intervals :
Level Percentile
95% ( 0.6752, 0.7418 )
Calculations and Intervals on Original Scale
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = bootstrap_results, type = "perc", index = i)
Intervals :
Level Percentile
95% (1823, 6199 )
Calculations and Intervals on Original Scale
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = bootstrap_results, type = "perc", index = i)
Intervals :
Level Percentile
95% (27609, 54316 )
Calculations and Intervals on Original Scale
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = bootstrap_results, type = "perc", index = i)
Intervals :
Level Percentile
95% (-56806, -31447 )
Calculations and Intervals on Original Scale
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = bootstrap_results, type = "perc", index = i)
Intervals :
Level Percentile
95% (-5999, -1587 )
Calculations and Intervals on Original Scale
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = bootstrap_results, type = "perc", index = i)
Intervals :
Level Percentile
95% ( 8562, 17141 )
Calculations and Intervals on Original Scale
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = bootstrap_results, type = "perc", index = i)
Intervals :
Level Percentile
95% ( 4496, 22918 )
Calculations and Intervals on Original Scale
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = bootstrap_results, type = "perc", index = i)
Intervals :
Level Percentile
95% ( 40255, 134748 )
Calculations and Intervals on Original Scale
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = bootstrap_results, type = "perc", index = i)
Intervals :
Level Percentile
95% (-73908, -24608 )
Calculations and Intervals on Original Scale
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = bootstrap_results, type = "perc", index = i)
Intervals :
Level Percentile
95% (-122227, -85911 )
Calculations and Intervals on Original Scale
Code
# Initialize a data frame to store confidence interval resultsci_data <-data.frame(Parameter =names(bootstrap_results$t0), Estimate = bootstrap_results$t0, # point estimates from original dataLower =numeric(length(bootstrap_results$t0)), # store lower boundsUpper =numeric(length(bootstrap_results$t0)) # store upper bounds)# calculate confidence intervals for each parameterfor (i in1:length(bootstrap_results$t0)) { ci <- boot::boot.ci(bootstrap_results, type ="perc", index = i)$percent ci_data$Lower[i] <- ci[4] # lower bound of the confidence interval ci_data$Upper[i] <- ci[5] # upper bound of the confidence interval}# create the plotggplot2::ggplot(ci_data, ggplot2::aes(x = Parameter, y = Estimate)) + ggplot2::geom_point(size =3) +# plot point estimates ggplot2::geom_errorbar(ggplot2::aes(ymin = Lower, ymax = Upper), width =0.2) +# plot CIs ggplot2::theme_minimal() + ggplot2::coord_flip() + ggplot2::labs(title ="Confidence Intervals for Model Parameters",x ="Parameter",y ="Estimate" ) + ggplot2::geom_hline(yintercept =0, linetype ="dashed", color ="red") + ggplot2::scale_y_continuous(labels = scales::comma) + ggthemes::theme_economist()
Linear Regression
280 samples
11 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 252, 252, 252, 252, 252, 252, ...
Resampling results:
RMSE Rsquared MAE
8546.728 0.9968758 6650.75
Tuning parameter 'intercept' was held constant at a value of TRUE
Based on the comprehensive analysis presented in the “ITEC 621 Project,” which centered on predicting median housing prices in New York City, several key insights and challenges emerged across the various modeling approaches used. The project utilized a substantial dataset compiled from reputable sources such as Zillow, the Federal Reserve, and the New York Police Department, integrating economic indicators like interest rates, housing market values, crime statistics, and socio-economic data, to explore their potential impacts on housing prices.
4.1 Model Assessments
The project applied multiple regression analysis methods to delve deeply into underlying patterns. The Ordinary Least Squares (OLS) model highlighted persistent issues of heteroskedasticity, non-normality in residuals, and serial correlation among residuals. This initial model included variables such as the Federal Reserve Interest Rates and crime statistics but excluded broader income measures, acknowledging their limited explanatory power in the face of local economic conditions.
To address the heteroskedasticity observed, a Weighted Least Squares (WLS) model, enhanced by stepwise regression, was deployed. This adjustment improved the fit by accounting for variance inconsistencies across data, corroborating the significance of all included variables. Furthermore, the refinement depicted how an appropriate model technique could moderate variance-driven distortions, although it still faced the challenge of addressing serial correlation thoroughly.
4.2 Log Transformation and Bootstrapping
To overcome the risk of overfitting associated with high initial R² values in the models, log transformations for variables like median household income and crime rates were conducted. The transformed model revealed an improved Adjusted R² score of 99.66%, suggesting better adaptability to changes in underlying variable distributions. However, to validate model fidelity and mitigate overfitting, a Bootstrapping approach was leveraged, ensuring the robustness of coefficient estimates by calculating confidence intervals and standard errors across multiple resamples.
4.3 Cross-Validation
Finally, cross-validation captured the model’s capability to generalize over unseen data folds, further supporting the model’s stability and predictive capability.
4.4 Final Conclusion
Overall, the analysis underscores the complexity of modeling housing prices, particularly in a dynamic and multifaceted economic environment like New York City. The iterative refinement of models—from OLS to WLS and then incorporating bootstrapping—reinforced the importance of adapting methods suitable to data characteristics to yield precise and actionable insights. Despite challenges like serial correlation, which prompt future exploration into time-series models, the project offers a data-driven framework to guide stakeholders in housing sectors on leveraging economic indicators in predictive modeling accurately.